Helga Sigríður Thordersen Magnúsdóttir s202027 
                                                                             Hlynur Árni Sigurjónsson s192302
                                                                         Katrín Erla Bergsveinsdóttir s202026
                                                                            Kristín Björk Lilliendahl s192296

purple-divider

Business Question 2 - Market Analysis

In order for all the figures to show up the notebook needs to be trusted: File -> Trust notebook

Where should we open up a new Pizza place, who should we invite to the opening and what should we keep in mind?

We will continue to work as consultants for a company that wants to open a pizza restaurant in Copenhagen. From Business Question 1, we now know what people like about pizza restaurants, what customers' favorite features about the restaurants are and what they complain about. Further text analysis has been performed and we will build on those findings.

In this business question we will focus on generating a detailed market analysis of the Pizza market in Copenhagen, to be able to answer the business question: Where should we open up a new Pizza place, who should we invite to the opening and what should we keep in mind? First the scope of the market will be defined in a broader sense, to include all restaurants that offer pizzas. We will gather additional data about the different areas (neighbourhoods) in Copenhagen, and use that to visualize the distribution and evoluation of pizza places. The popularity of the pizza places will be estimated and the reviews of each place will be used to generate both TF-IDF and Sentiment for each place. Those will then be used to evaluate the differences between the pizza places in different areas. A network of the market and the reviewers will be generated in hopes of gaining insights into the relationship between restaurants and reviewers. Based on those networks, the most influential reviewers will be identified to compile a short list of influencers that could be invited to a grand opening.

Some of the market questions answered include:

Finally a recommendation will be made on where a new Pizza place should be opened, who should be invited to the opening and what should be kept in mind before opening the place.

Contents


green-divider

Imported data 🐼

The first step as always is to install and import the necessary packages.

purple-divider

1. Gather data 🇩🇰

The first step is as always to gather the data needed to perform the analysis. The data was prepared and pickled in the Milestone 1 notebook for further data analysis. Let's start by reading those in.

However, when looking at the shapefiles, it can be seen that they include all municipalities in Denmark, without much detail of the different neighbourhoods in Copenhagen. Each municipality is plotted in a seperate color based on their ID in the dataset.

In order to change that, an additional shapefile dataset has to be downloaded and the two geodatasets merged together. Open Data DK has a shapefile dataset available that contains the differente areas (bydele) of Copenhagen. The data was downloaded from their webpage. The analysis will be focused on the Copenhagen area, as it was clear from the Milestone 1 notebook, that most of the restaurants are located in that area.

Since Frederiksberg is enclosed by Copenhagen, it will be included in the analysis, by using the data from the previously gathered shapefiles and merging it with the data about the Copenhagen areas. Additionally, the information about the area of Frederiksberg was gathered from Wikipedia and added to the geopandas dataframe and the results then plotted.

The different areas of Copenhagen can now be seen on the shapefile, with Frederiksberg included. These areas will be the main focus of the analysis. The geopandas dataframe also includes information about the area (m^2) of each neighbourhood as well as the population of each one. These will come in handy later on.

green-divider

2. Data exploration and enrichment 🧐

Some further data exploration and enrichment will be performed, on top of what was already done in the Milestone 1 notebook.

As preperation for the business questions to be answered later on, and as further data exploration the following will be investigated:

2.1 Location data📍

Mission: Use generated shapefile to find out which area of Copenhagen each restaurant is located in. Filter dataset on those restaurants.

Before it is possible to use the latitude and longitude data, it needs to be converted to a float value. The number of reviews that each restaurant has is also converted into an integer at the same time, just for convenience.

Then a function can be defined that takes as an input lat and lon data, along with a geopandas dataframe. It checks if the lat and lon are provided. If both are None then based on the cell above, it is clear that there is no location data provided. If neither are None then the lat and lon data is used to create a Shapely type Point from the data. That point is then compared with all the polygons in the shapefile, and the name of the area that contains the point is returned.

Then the function can be run on the whole restaurants dataframe and a new column called area generated

We now have information in the restaurants dataframe about which area the restaurant is located in. In total there are around 20% of the restaurants in the dataset that are located outside of the Copenhagen capital area.

Lets create a new dataset that is filtered only on those restaurants that are within the Copenhagen area. The dataset contains information about 1889 restaurants.

To get a sense of the distribution of restaurants, each restaurant was plotted (in white) on top of the different areas of the capital area. As previously, each area is plotted in a different color according to its ID.

Now it can clearly be seen that the majority of the restaurants are located in Indre By and the neighbouring areas, such as Vesterbro, Nørrebro and Østerbro.

Conclusion: We now have information about which area each restaurant is located in and have filtered the dataset to only include restaurants in the Copenhagen capital area.

2.2 Review text clean up 🧼

Mission: Preprocess the review text, and generate term frequencies for each review and combine them for each restaurant.

Before raw text can be used for analysis, it is necessary to clean the text up first. The first step is to read in all the reviews, and do some necessary type conversion. Since additional reviews were added after the Milestone 1 notebook, there reviews will be read directly from the source, and the ratingDate converted to a datetime object.

In total there are 109.749 reviews, but since there appear to be some duplicates from the data extraction, those will be dropped using pandas drop duplicates.

After dropping duplicates there are 89.200 reviews in the dataset

Then the nltk package will be used for preprocessing the data. A function was created that does the following text clean up:

The functionality of the function is then tested to show that it is working properly before proceeding. It's important to note that some of the common stopwords include negation words, such as 'not' so the whole context of the review can change when the text is processed in this way.

After applying the function to the whole dataframe and generating a new column, the reviews dataframe is as follows. For the text clean up, both the reviewHeader and reviewText were used as a whole, to include the total review text.

The dataset now includes information about the clean text of each review.

Create Term Frequencies for each review

Using the cleaned text, a term frequency distribution was generated for each review, using the FreqDist function from nltk. It generates a special type of dictionary, where each word is the key and the frequency is the corresponding value.

Create combined Term Frequency for each restaurant

Now it is possible to sum together the term frequencies of the reviews, into a single term frequency distribution for each restaurant. This is done by groupping the reviews by the restaurant name, and summing the term frequencies together.

Create a normalized Term Frequency for each restaurant

Since each restaurant has a different number of reviews, and the reviews are of different lengths, a new column called tf_norm is generated, where each tf is normalized based on the number of words in the reviews of that restaurant. That way it is easier to see the relative frequency of each word in the reviews.

Conclusion: Now the term frequencies of each review and combined term frequencies for each restaurant are ready to be used.

2.3 Find all Pizza restaurants 🍕

Mission: Find all the restaurants that sell Pizza, and filter the dataset on those.

In order to filter the dataset to only include the restaurants that offer Pizza, three things will be used:

The first step is to find all the restaurants that have a review that includes the word 'pizza' and mark those as having a review that includes pizza.

The second step is to check if the Cousine Type of the restaurant is 'Pizza' or if the name of the restaurant has the substring 'pizz'. After combining those and using as a filter, the results will be a dataframe filtered to only include the restaurants that sell pizzas.

Conclusion: So now the dataset has been filtered down to 305 restaurants, that all seem to offer pizzas, even if they don't specifically advertise themselves as pizzerias.

2.4 Review Sentiment 👍

Mission: Analyse the sentiment of each review and generate the average review sentiment of each restaurant.

The sentiment of the review text, should in theory be very telling of how the customers' experience at the restaurant was. In order to compute the sentiment of a review, the NLTK package was used. It contains the SentimentIntensityAnalyzer which computes the positivity/neutrality/negativity of text as explained in the documentation.

It's important to note the that raw review is used as input in the function. The reason being that if any cleanup had been performed before, it could have a significant impact on the results. As an example:

If this sentence would be cleaned up and common stop words removed, then the not would be removed and the sentance became:

The Sentiment Intensity Analyzer has ways of handling negation such as "not good" and therefore it is important to keep the original context of the review text when evaluating the sentiment.

It's also important to include the text from reviewHeader in the sentiment analysis, as that can be considered to be an important part of the review.

In order to make the analysis quicker, the reviews dataset is filtered to only contain the reviews relating to the set of restaurants offering pizza, as were extracted in the previous section 2.3 Find all Pizza restaurants. Then the sentiment analysis was performed and a new column generated in the reviews dataset

In order to check out the generated sentiments, all the sentiment scores where plotted on a histogram by the review rating.

From the above histograms it can clearly be seen that reviews of 4 and 5 almost exclusively have a positive sentiment. The majority of reviews with a rating 3 also have a positive sentiment, although it is slighlty lower than for the higher ratings. Reviews of rating 2 have a pretty balanced sentiment scores, while the reviews with rating 1 have a majority of reviews with negative sentiment.

In some cases can the data discrepancies be explained by the reviews including positive words, that are not applied to the restaurant in question such as in the review below "There are many other wonderful Italian restaurants in Copenhagen as we have discovered. Walk pass La Rocca and find a friendlier and more welcoming place"

Generate average review sentiment for each restaurant

Finally the review sentiments were aggregated, and the average review sentiment of each restaurant generated.

Conclusion: Now the average sentiment score of each restaurant is available in the dataset.

2.5 Opening date/Closing date 🎊

Mission: Use review dates to estimate opening date/closing date of each restaurant, using the oldest and newest review dates.

The dataset does not include information about the opening date/closing date of each restaurant, but it might be very valuable information for further analysis.

In order to estimate these dates, the date of reviews will be used. The opening date will be estimated based on the date of the oldest review, while the closing date will be estimated based on the date of the most recent review. If the most recent review date is around mid December 2020 or after that, then the restaurant can be assumed to still be in business.

Due to COVID, there were many restaurants that had to close down, as soon as the Danish government imposed a lockdown in Denmark. That forced all non-essential businesses to close down temporarily. Some restaurants were able to still operate, while only allowing take-away.

Edge cases when the restaurants have no reviews

Since not all the restaurants have reviews, it will not be possible to estimate the opening/closing dates based on the dates of reviews. Since these columns will be used later on, the simplest way is to drop these rows. Below is an example of 3 restaurants containing no ratings, no reviews and therefore no min_ratingDate or max_ratingDate

In total there are 16 pizza restaurants that contain no English reviews.

Since there is very little information about these restaurants, and no reviews, it might be most benefitial to simply drop them from the analysis. Another way would have been to hard code their opening/closing dates as the beginning and ending of the review period. Indicating that the restaurants would have been operational from 2007 until 2021. However it seems highly unlikely that the resturants would be open for such a long time, without receiving a single English review. Therefore it seems to be better representitive to simply remove those restaurants, that have no ratings.

So the final pizza dataset has 289 restaurants.

Conclusion: The opening/closing date of each restaurant has been estimated and restaurants with no English reviews dropped from the analysis.

img

2.6 Estimators for popularity 🌟

Mission: Create a simple estimator of popularity that can be used in further analysis later on.

In order to create simple estimators of popularity, the number of reviews, per day open, was generated for each restaurant. Since the English reviews are the only reviews used in this analysis, the main estimator will be:

However for comparison another popularity estimator will be generated from the total number of reviews:

In order to account for the last day of the opening period, a +1 was added to the period. Otherwise the restaurants that only have a single review, would be counted as having been open zero days.

The first step is to calculate the number of reviews per day given all the languages of the reviews.

Then the same metric is created using the number of English reviews, and the popularity calculated from that.

Now there are some new columns in the dataset that contain the popularity information. Lets visualize those.

From the two histograms above it can clearly be seen that the majority of restaurants have very few reviews each day. However there is clearly an outlier when it comes to the nrReviews_per_day. When examining the data, it can be seen that the restaurant Frankies Pizza Frederiksberg has 15 reviews, but only one of them was in English. Therefore will the opening/closing date be estimated to be the same and all the reviews are counted as having happened on this single date. So it seems that for the popularity, it might be better to stick to the information from the English reviews.

Conclusion: We have now generated simple estimators of popularity for the restaurants, and concluded that it is best for our analysis to focus on the nrReviews_per_day_EN popularity estimator.

purple-divider

3. Market analysis 📈

Now that the data has been prepared, it is possible to start the market analysis of the Pizza market in Copenhagen.

img

3.1 Cousine Types 🍽

Mission: Find out if the restaurants are 'advertising' themselves as pizza places, or if pizzas are simply on the menu, while the main cousine type is something else.

It's interesting to note that restaurants that offer pizzas, don't necessarily advertise themselves especially as pizzerias. When the CousineTypes of the restaurants are plotted in a bar chart, it can be seen that around half of the dataset is categorized as Italian or Pizza, while the rest has a broad variety of different categories.

Now there is a special column in the dataset with a binary classification of wheather or not pizzas are the main cousine of the restaurant.

From the above summary table we can see that the pizzerias have a slightly higher average rating than the other restaurants. They are less expensive and have a slightly lower average sentiment. They have on average much fewer English reviews, but are however much more popular than the other restaurants, as they have a much higher number of English reviews per day.

Conclusion: Around half of the restaurants in the dataset are specially categorized as pizza places, while the other half of the restaurants have a different main focus, but still offer pizza on the menu. So while the main competitors would be the restaurants classified as Pizzerias, there are still a lot of other restaurants that also offer pizzas, without categorizing themselves as Pizzerias. The pizzerias are more popular but still have a fewer number of total reviews than the other restaurants. The sentiment is lower for the Pizzerias but the average rating is higher. The pizzerias tend to be categorized cheaper than the other restaurants.

3.2 Where are the pizza restaurants located? 💥

Mission: Visualize where the pizza restaurants are located within the city.

Now it is possible to generate a plot of Copenhagen, showing where the pizza restaurants are located.

In the figure above, all of the restaurants in the dataset have been plotted on top of the different areas of Copenhagen. The black dots are restaurants that don't have Pizzas are their main Cousine Type, while the white stars are those restaurants that self-identify as Italian or Pizzerias.

Conclusion: The distribution of both pizzerias and restaurants that offer pizzas, is mainly focused around the center of Copenhagen. The majority is in Indre By, and close to the edges of the neighbouring areas, mainly, Vesterbro, Nørrebro and Østerbro. There seem to be quite a few designated pizza places in Bispebjerg, compared to the other areas that are further from the center, that usually only have a single one, or just restaurants that offer pizzas.

3.3 Differences between areas (neighbourhoods) 🏙

Mission: Analyze the differences between the CPH areas, with respect to the restaurants.

Let's further examine the difference between the areas (neighbourhoods) with respect to the pizza places.

The coloring of the cells tends not to show up when the Jupyter notebook is loaded again, so here is a photo:

img

But let's look at this again without the areas that have a very few number of places.

The coloring of the cells tends not to show up when the Jupyter notebook is loaded again, so here is a photo:

img Here the comparison is slightly more "fair" as these areas all have a decent number of places.

Now lets look at each area as a whole

Visualization of the differences - Popularity

Now the data for the main six areas will be visualized using the shapefiles, to give a clearer picture of the differences between the areas. Since the table above gives a good overview of all the different values, the visualization will only be done for the popularity measure.

Here we can see visually which areas where investigated in more detail. Here it is clear that Østerbro and Vesterbro are the most popular areas, With Frederiksberg being the least popular of those.

Conclusion: The different areas of Copenhagen display different characteristics, and depending on what type of Pizzeria the clients want to open the recommendations could be different. However as a general guideline we would receommend looking first at opening a restaurant in Vesterbro.

Vesterbro Has the lowest average rating, the second lowest sentiment, but is still very popular. It is the area with by far the lowest ratio of Pizzerias, while still having a high number of restaurants. It's less expensive than Indre By and Østerbro, but way more expensive than Nørrebro.

The client is interested in opening up a new Pizza place. If this place would be positioned in Vesterbro, then it would be surrounded mainly by restaurants that serve pizzas but are not classified as Pizzeries. As shown in 3.1 Cousine Types the popularity of pizzerias is much higher than the popularity of the other types of restaurants, so we would be positioning ourselves in a really popular area, that has a low ratio of pizzerias, low average scores and sentiment values. By opening up a great new pizza place that would generate a high sentimental value and average rating, the place would stick out from the average crowd of other restaurants. We would also be able to position ourselves in a pretty high Price Category, but would have the option of offering cheaper food, in hopes of the sentiment and reviews being higher, given that the food is a great value for the price.

3.4 Evolution of distribution of pizza places 📊

Mission: Use the open/close dates and plot a heatmap with timeseries, showing locations of restaurants that are open in each year, from 2007-2021. This will show how the distribution of restaurants changes and perhaps show some trends.

Now the dataset contains estimated information about when each restaurant opened and closed. From that it is possible to generate a heat map, showing which restaurants were open in each year, and where they were located. For this we will use Folium

The plot starts with just a single restaurants in 2007, but in 2008 there appear to be a hotspot starting in Indre By, with a few places opening up in the neighbouring areas, like Frederiksberg and Vesterbro. In 2009 the hotspot in Indre By grows, and additional places pop up in the neighbouring areas. The center hotspot keeps growing until 2013, when it has connected to the neighbouring areas, Vesterbro, Frederiksberg, Nørrebro and the most south areas of Amager. Until 2016 the growth of the centering hotspot expands North-West, more towards Nørrebro, while also reaching slightly more towards Østerbro. Until 2019, the graph changes very little, with some areas shifting sligthly but the hotspot covering very similar area as before. From 2019-2020 it seems that many of the restaurants in the outskirts have closed down, as the outliers seem to be fewer in the map. Finally of course in 2021 only a small part of the restaurants appear on the map because of COVID.

Conclusion: Most of the restaurants are and have been focused around the center. Some restaurants further from the center pop up every now and then, but it seems that many of them have closed down in 2020, or simply cater more towards locals instead of turists, which would explain why they are less prominent in the dataset. The business for restaurants offering pizzas still seems to be pretty thriving, given the number of those places. But these places took a major hit in COVID, which can partly be explained by the lockdown in Denmark, which only allowed the places to offer take-away. Another reason for a lack of restaurant reviews in 2021, is that the dataset only contains English reviews, but because of Covid, the majority of customers during this periods, must have been Danish, and their reviews would most likely be written in Danish.

3.5 Restaurant density in areas 👩‍🍳

Mission: Analyse the density of restaurants for each area.

So which areas contain the highest density of pizza restaurants, and how much differences are there between different areas?

Let's add the population data for 2021 Q2 for each area, found here. For Frederiksberg the population data for 2021 Q1 was found here.

Now we have added the population data for each area from the above sources.

Then we can calculate the ratios and display them with the shapefiles.

The plot above shows the density of restaurants given the area(m^2) of the neighbourhoods. Here it is clear the Indre By has by far the highest one, followed by Nørrebro and then Vesterbro. The other areas are much lower in density.

Not surprisingly the density of restaurants is highest in Indre By, given the population of the area. Alot of people live in Indre By and the restaurant selection there is the greatest one. The same areas as before come in second and third place, namesly Vesterbro and Nørrebro.

Conclusions: The density is by far highest in Indre By both given area(m^2) and population. Vesterbro and Nørrebro come in second and third place.

3.6 Restaurant count development 🕵️‍♀️

Mission: See the development of the number of pizza places staying open in each area.

So where are new places opening up or closing down in each year? In order to examine this the number of open places for each year was computed, and the data groupped on the area. In order for the graphs to be clearer, the areas containing less than six places are excluded.

From the figure above we can see that Indre By both has the highest number of open restaurants at any given time, and also had the steepest slope, that is more open restaurants each year. It's also the area that has suffered the highest number of closures between years, after COVID hit. Vesterbro follows a similar curve as Indre By just to a lower degree. The rest of the areas follow a pretty similar courve, with the exception that the decline in Amager Vest started in 2016, but in 2018-2019 for the other areas.

Lets examine the differences in the number each year.

If we look at the differences between the number of places open each year, some interesting patterns emerge. Most of the areas have a pretty stable number of increases/decreases between the years, but the values for Indre By fluctuate quite a lot between different years. Let's keep in mind that they are almost always increasing, but they increase much less in the years, 2011 and 2013, compared to 2009, 2012 and 2014. The "closures" because of COVID, starting in 2020 and continuing in 2021 affect the scale of the image in a great way, so lets exclude 2021 and draw the figure again.

Now we can see even clearer the fluctuation in the number of places opening um in Indre By. It is clear the the pizza market has been growing and expanding ever since 2007 and until 2017, there were always a higher number of places opened compared to last year. Things start taking a turn for the worse after 2017, where the number of restaurants stays the same for two areas, Frederiksberg and Østerbro, and decreases for Amager Vest and Nørrebro. In 2019 there were decreases in the number of restaurants in four areas, compared to the year before. Those areas are Amager Vest, Frederiksberg, Indre By and Vesterbro. Østerbro has the same number of restaurants as the year before.

Conclusions: The pizza market has expanded considerably with Indre By leading the way and Vesterbro following closely behind. The market was expanding rapidly from 2007-2017, but after that there is a decline in the slope and in some cases there are declines in the number of open pizza places each year. The industry took a big hit when because of COVID starting in 2020. However there was already a decline before COVID, because the restaurants that were open at the start of 2020 count as having been open the entire year of 2020. But because of the market condititions and uncertainty there might have been fewer new places opening up, than usually. Given that the financing of the client's new Pizza place is secure, there are definately some opportunities in the market now, with there being fewer competitors than normally.

3.7 TF-IDF of review scores - What matters for a high review? 🥇

Mission: Use TF-IDF to analyse what words are most distinctive for different review ratings.

Here we will use the pizza dataset and group the reviews having the same score together.

Now the Term Frequency distribution of each review score can be used to generate both a normalized Term Frequency distribution, and the Term Frequency - Inverse Document Frequency distribution for each rating.

Now are the TF-IDF frequencies ready to be used to compute wordclouds for each review score.

Conclusion: Similarly to what was analyzed in Business Question 1 the high review scores indicate that some important features of the restaurants scoring well are related to the words fantastic, amazing, selection, authentic and enjoyed. The lower scores more distinctly have words with a more negative meaning, such as worst, rude, joke, pretending and confused.

Mission: Use the popularity estimates and TF-IDF analysis to figure out what the popular restaurants are doing differently.

Similarly to what was done before, a TF-IDF can be generated for each restaurant. It will then show what words are most distinctive and common for each restaurant. This will show what makes each restaurant different from the rest, and can help us gain an understanding of what is needed to open up a popular restaurant.

From the wordcloudes above we notice some interesting things. First of all we spot that Neighbourhood appears twice. This can be explained in the next cell. If we focus first on Mother we can see that some of the most distinctive words there are brunch, packing, district. It seems that Mothers offers a popular Brunch, and is located in the Meat Packing District in Vesterbro. From Mothers homepage we can find the following about their brunch:

img

If we shift our focus to Neighbourhood the main words there are cocktail and organic. These are both things that the company puts emphasis on in their advertising, as can be seen in the Google search results

img

Pizzeria Mamemi & Wine Bar has a big emphasis on wine and westmarket. The name includes Wine Bar so clearly they put emphasis on that in their own marketing and offerings. westmarket relates to the location, but the place was located close to that Food court. Bæst has the main focus on it's own name, followed by charcuterie, organic and mozzarella. When looking at their homepage we came across this information:

img

so it seems fitting that these words would be prominent in the wordclouds. Trattoria Fiat Has more general words, but we can see that some of the main ones are italian, courtyard and atmosphere. When looking at their Facebook, they put a lot of focus on showing how great the atmosphere is in their courtyard.

img

There are two Neighbourhood places, one located in Jaegersborggade (Nørrebro) and the other one in Istedgade (Vesterbro). But because of how the scraping was performed, there is no way to distinguish which reviews are for which location. So both restaurants will have the same features even if the main location is clearly in Vesterbro. This is something that could be improved in the way the data was scraped, if the project were to be repeated.

Lets look at the locations of those most popular places.

Conclusions: The post popular pizzerias are located in Indre By and Vesterbro. All of the top 5 most popular places make sure to distinquish themselves from their competitors by offering something unique. In some cases it can be great cocktails, in others great wine, or a unique outdoor area. Some put more emphasis on the materials used being organic or even homemade. But it is clear that each one made sure to state clearly what they wanted to be best known for.

3.9 Network Analysis of the Pizza Market 🍕

Mission: Analyse the network of the pizza market, the relationship between reviewer and restaurant. Analyze which places and reviewers can be groupped together into communities, find out which users have most influence.

The first thing added here is additional information about the reviewers, lets call them users for clarity and less confusion with the reviews themselves. The dataset contains information about the the location of the reviewer, when they joined, number of contibutions, reviews, upvotes, followers and which other users this user follows.

Then we generate a list of all the reviewers, and use all the restaurants to generate a graph showing all the reviewers and which restaurants they have reviewed. Then a network is built using this data. The network shows all the reviewers and restaurants in the dataset, where each reviewer and restaurant is a node, and the rating the user has given the restaurant is an edge. Both the nodes and edges have attributes that come from the datasets we have generated before.

The graph above shows the interaction between all the restaurants in the dataset (both pizzerias and non pizzerias) along with the users that have rated each place. Pizzerias are colored red, non-pizzerias green and the users are colored in blue.

The network shows that the majority of the reviewers have only rated a single restaurant, and that causes the clusters of reviewers close to each restaurant. This is not surprising as the dataset only contained English reviews, which are usually turists, as natives tend to write their reviews in Danish. But there are some cases where reviewers have rated multiple locations. For instances, there are two green (non-pizzerias) in the top left, where we can see a small cluster of reviewers in the are between them.

Network of only Pizzerias

Lets now create a similar network but only displaying the Pizzerias and the users that have rated them.

The pizzerias and their reviewers are plotted on a graph, only showing the greatest connected component (GCC) in order for the plot to show up clearly. The restaurants are in orange and the reviewers are in blue. (hope the distinction is clear for the color blind) The names of the top 10 nodes based on degree centrality are shown. Those include 9 pizzerias and 1 reviewer. The figure shows that there are certain clusters of users that rate the mainly the most popular restaurants. The graph also shows that the user "elvirasandberg" has rated a lot of restaurants, as this users degree centrality is very high. This indicates that she is a big pizzeria fan.

As can be seen from the fact that she has rated a total of 17 of the pizzerias in Copenhagen.

Community detection in the network

Lets try to apply community detection algorithms to the network, and see if the users and pizzerias can be groupped together into different communities. After applying the community detection algorithm on the dataset, there are 12 communities detected.

Now a small helper function is defined to find the community number of a node.

Then we plot the network, where each community is represented by a specific color.

From the graph above we can see the different communities represented by different colours (sorry I know you are color blind but I don't know how to present this in a different way 😬). It groups restaurants and users together based on similarities in the network. From the above we can see that the purple community consists of Bæst, Mother and the users close to them. The blue community consists of Trattoria Fiat, Mama Rosa and the reviewers close to them. Neihgbourhood is located in the center in green, with users spreading out through the network. The orange community consists of Pizzeria Mamemi & Wine Bar along with users such as elvirasandberg. We can see that the nodes location in some instances seem to correspond to neighbourhoods, with the orange, red and blue neighbourhoods being particularly clear in the figure.

The most influential reviewers

So can we use this information to determine which reviewers are most influential, to figure out which users should be invited to the grand opening of our client's new pizza place?

In the graph above, we plot the network of restaurants and users, with the colors representing the community of each. The size of the nodes correspond to the number of reviews for the resturants, but for the users the size is represented by the number of followers they have multiplied by the number of upvotes each has gotten. Some of the biggest restaurants (with respect to degree centrality) are printed along with the most influential users. Lets examine the most influencial users more by printing out their information.

From the above we can see that the most "influential" reviewers are not really that influential in the graph, since they have few followers, even if they have decent number of upvotes. Since this is only based on reviews in English, this is not surprising, as the majoritiy of the reviews could easily be in Danish in Denmark. However our recommendation would be to invite those users that live in Copenhagen, Mer21maid and kmpuggaard. Along with JoseF98 who lives in Malmö, Sweden. By doing that we are inviting three users, that are all a part of a different communities (7, 1 and 2), so their influence will be maximized, as they did not have enough in common to be classified as part of the same community. Since elvirasandberg has also rated 17 pizza restaurants, it might be a good idea to extend an invite, even if the user has no upvotes or followers, it's also highly likely given the number of reviews that the user lives in the area, even if not specified.

The effect of inviting four selected users to try the new place

So what effect might we have by inviting these four selected users to the grand opening. These users are apart of three seperate communities. Communitites 1, 2 and 7. Lets start by looking at which places and users are apart of this community

From the above we can see that the network mostly consists of restaurants (bigger nodes) along with their reviewers. Some reviewers have no influence and no upvotes, and those are shown by tiny nodes with no labels. By inviting those four users we have the potential to influence these 3 communities, as other users within the communities might be influeced by our chosen users. The main player in the network is Pizzeria Mamemi & Wine Bar as we have previously seen is one of the most popular pizzerias in the market. Since our recommendation is to open up a place in Vesterbro, lets examine which places in Vesterbro are apart of this network.

These two pizzerias would be the main players effect if our approach would prove successful. They have different profiles, one extremely popular while the other is less so. The popular place has a very low price category, so we might decide not to position ourselves to closly to them pricewise, as we have seen previously that the areas average price category is way higher.

Conclusions: It's difficult to use such sparse networks to really capture the true influence of users to find out which ones would be best to invite. The restaurants and reviews were split up into 12 communities, each with some similarities between its' nodes and dissimilarities to the nodes in other communities. From those 12 communities we detected four users that should be invited to the grand opening, coming from three communities. These results are all dependant on the English reviews, but it would be good to extend this analysis to the Danish reviews, in order to hopefully get a more dense network, or at least a network with more influential nodes.

green-divider

Conclusions and Discussions

Lets circle back to the market questions we had to answer in order to be able to answer the business question:

Where should we open up a new Pizza place, who should we invite to the opening and what should we keep in mind?

Where should we open up a new Pizza place, who should we invite to the opening and what should we keep in mind?

Summary: The pizza market consists of much more than only pizzerias. The market has been expanding until 2017 but been in some decline the last few years. There are oppurtinites in the market now after COVID and the best area to open up a designated pizzeria is estimated to be Vesterbro. Four users should be invited to the opening, Mer21maid, kmpuggaard, JoseF98 and elvirasandberg. The clients need to distinguish themselves from other market players, by coming up with something unique to offer, either in food or experience.

What could be improved

Thank you for reading all the way to the end 🥳

img